Conversation
WalkthroughIntroduces a new merge module in arrow-select for array merging operations, including MergeIndex trait and merge_n/merge functions. Adds benchmark infrastructure for the merge kernel. Re-exports merge functionality from the main arrow crate's compute kernels module. Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review for PR #17: Add
|
| Category | Rating | Notes |
|---|---|---|
| Code Quality | ⭐⭐⭐⭐ | Excellent structure and documentation |
| Test Coverage | ⭐⭐⭐ | Good basic tests, needs more edge cases |
| Performance | ⭐⭐⭐⭐⭐ | Well-optimized with comprehensive benchmarks |
| Security | ⭐⭐⭐⭐⭐ | No concerns |
| API Design | ⭐⭐⭐⭐⭐ | Clean, extensible design |
🎯 Recommendations
High Priority (should fix before merge):
- Fix empty values array panic in
merge_n - Fix capacity hint in
merge(truthy.len()→mask.len()) - Fix benchmark parameter order
Medium Priority (nice to have):
4. Change # Safety to # Panics
5. Add edge case tests (empty mask, nulls in mask, etc.)
6. Fix documentation reference ([merge] → [merge_n])
Low Priority:
7. Verify take_offsets size (+ 1 necessary?)
Great work overall! The implementation is solid and the benchmarking is particularly thorough. Once the high-priority items are addressed, this will be ready to merge.
Let me know if you have any questions or would like clarification on any of these points!
| &masks, | ||
| &array_1_10pct_nulls, | ||
| &non_null_scalar_1, | ||
| ); |
There was a problem hiding this comment.
Bug: Benchmark Logic Inverted, Yielding Misleading Data
The benchmark "non_null_scalar_vs_array" swaps the arguments, passing the array as truthy and the scalar as falsy, when the name suggests it should be the opposite. This makes the benchmark measure the wrong scenario and produces misleading performance data.
There was a problem hiding this comment.
value:bug; category:documentation; feedback:The Bugbot AI reviewer is correct that both unit tests do the very same. The second unit test needs to swap its last two arguments. The finding prevents regressions due to untested paths.
| let falsy = falsy_array.to_data(); | ||
| let truthy = truthy_array.to_data(); | ||
|
|
||
| let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, truthy.len()); |
There was a problem hiding this comment.
Bug: Suboptimal Memory Allocation for Array Data
The capacity hint for MutableArrayData::new uses truthy.len() but should be mask.len() since the output array length equals the mask length, not the truthy array length. This causes incorrect memory allocation sizing and potential unnecessary reallocations.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback:The Bugbot AI reviewer is correct that the capacity is set to a lower value than the number of items that are going to be inserted later. Prevents resizing of the structure during its lifetime.
| /// | ||
| /// ``` | ||
| pub fn merge_n(values: &[&dyn Array], indices: &[impl MergeIndex]) -> Result<ArrayRef, ArrowError> { | ||
| let data_type = values[0].data_type(); |
There was a problem hiding this comment.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot AI reviewer is correct that there is no check for non-empty values array before trying to use the first item from it. The finding prevents a panic caused by out of bounds access.
| /// | ||
| /// ``` | ||
| pub fn merge_n(values: &[&dyn Array], indices: &[impl MergeIndex]) -> Result<ArrayRef, ArrowError> { | ||
| let data_type = values[0].data_type(); |
There was a problem hiding this comment.
merge_n indexes values[0] without checking for an empty values slice; calling this with no inputs will panic. Consider validating values.is_empty() and returning an error up front.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct that there is no check for non-empty values array before trying to use the first item from it. The finding prevents a panic caused by out of bounds access.
| return Ok(new_empty_array(data_type)); | ||
| } | ||
|
|
||
| #[cfg(debug_assertions)] |
There was a problem hiding this comment.
Out-of-bounds indices are only checked under debug_assertions; in release builds an invalid index can panic when used by MutableArrayData::extend. Consider validating indices (or documenting this invariant) for non-debug builds as well.
🤖 Was this useful? React with 👍 or 👎
| ); | ||
|
|
||
| bench_merge_input_on_all_masks( | ||
| "non_null_scalar_vs_array", |
There was a problem hiding this comment.
The "non_null_scalar_vs_array" case passes array_1_10pct_nulls as truthy and non_null_scalar_1 as falsy (same order as "array_vs_non_null_scalar"), so both benchmarks measure the same input order. Consider swapping the arguments or adjusting the label.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:bug; category:documentation; feedback:The Augment AI reviewer is correct that both unit tests do the very same. The second unit test needs to swap its last two arguments. The finding prevents regressions due to untested paths.
|
There was a problem hiding this comment.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
arrow-select/src/lib.rs(1 hunks)arrow-select/src/merge.rs(1 hunks)arrow/Cargo.toml(1 hunks)arrow/benches/merge_kernels.rs(1 hunks)arrow/src/compute/kernels.rs(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
arrow-select/src/lib.rs (1)
arrow-select/src/merge.rs (1)
merge(208-303)
arrow-select/src/merge.rs (2)
arrow-select/src/filter.rs (1)
prep_null_mask_filter(122-126)arrow-array/src/array/mod.rs (1)
new_empty_array(856-859)
arrow/benches/merge_kernels.rs (2)
arrow-array/src/array/mod.rs (1)
new_null_array(874-876)arrow/src/util/bench_util.rs (1)
create_boolean_array(103-118)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Cursor Bugbot
- GitHub Check: codex
- GitHub Check: claude-review
| let data_type = values[0].data_type(); | ||
|
|
||
| for array in values.iter().skip(1) { | ||
| if array.data_type() != data_type { | ||
| return Err(ArrowError::InvalidArgumentError(format!( | ||
| "It is not possible to merge arrays of different data types ({} and {})", | ||
| data_type, | ||
| array.data_type() | ||
| ))); | ||
| } | ||
| } | ||
|
|
||
| if indices.is_empty() { | ||
| return Ok(new_empty_array(data_type)); | ||
| } | ||
|
|
||
| #[cfg(debug_assertions)] |
There was a problem hiding this comment.
Guard against empty values.
Line 111 dereferences values[0]; calling merge_n(&[], …) panics before we can return an error. Please reject empty values up front with an InvalidArgumentError, e.g.:
+ if values.is_empty() {
+ return Err(ArrowError::InvalidArgumentError(
+ "merge_n requires at least one input array".into(),
+ ));
+ }
+
let data_type = values[0].data_type();This keeps the kernel from aborting on malformed input.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let data_type = values[0].data_type(); | |
| for array in values.iter().skip(1) { | |
| if array.data_type() != data_type { | |
| return Err(ArrowError::InvalidArgumentError(format!( | |
| "It is not possible to merge arrays of different data types ({} and {})", | |
| data_type, | |
| array.data_type() | |
| ))); | |
| } | |
| } | |
| if indices.is_empty() { | |
| return Ok(new_empty_array(data_type)); | |
| } | |
| #[cfg(debug_assertions)] | |
| if values.is_empty() { | |
| return Err(ArrowError::InvalidArgumentError( | |
| "merge_n requires at least one input array".into(), | |
| )); | |
| } | |
| let data_type = values[0].data_type(); | |
| for array in values.iter().skip(1) { | |
| if array.data_type() != data_type { | |
| return Err(ArrowError::InvalidArgumentError(format!( | |
| "It is not possible to merge arrays of different data types ({} and {})", | |
| data_type, | |
| array.data_type() | |
| ))); | |
| } | |
| } | |
| if indices.is_empty() { | |
| return Ok(new_empty_array(data_type)); | |
| } | |
| #[cfg(debug_assertions)] |
🤖 Prompt for AI Agents
In arrow-select/src/merge.rs around lines 110 to 126, the function dereferences
values[0] without checking for an empty slice which causes a panic; add an
upfront guard that returns Err(ArrowError::InvalidArgumentError("merge_n
requires at least one input array".to_string())) (or similar descriptive
message) if values.is_empty(), placing this check before reading values[0]; keep
existing type-equality loop and subsequent logic unchanged.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that there is no check for non-empty values array before trying to use the first item from it. The finding prevents a panic caused by out of bounds access.
There was a problem hiding this comment.
@martin-augment, thank you for confirming! I'm glad the empty values array check will help prevent that panic. The guard will ensure the function returns a proper error instead of crashing on out-of-bounds access. 🎉
If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!
|
|
||
| // This loop extends the mutable array by taking slices from the partial results. | ||
| // | ||
| // take_offsets keeps track of how many values have been taken from each array. | ||
| let mut take_offsets = vec![0; values.len() + 1]; | ||
| let mut start_row_ix = 0; | ||
| loop { | ||
| let array_ix = indices[start_row_ix]; | ||
|
|
||
| // Determine the length of the slice to take. | ||
| let mut end_row_ix = start_row_ix + 1; | ||
| while end_row_ix < indices.len() && indices[end_row_ix] == array_ix { | ||
| end_row_ix += 1; | ||
| } | ||
| let slice_length = end_row_ix - start_row_ix; | ||
|
|
||
| // Extend mutable with either nulls or with values from the array. | ||
| match array_ix.index() { | ||
| None => mutable.extend_nulls(slice_length), | ||
| Some(index) => { | ||
| let start_offset = take_offsets[index]; | ||
| let end_offset = start_offset + slice_length; | ||
| mutable.extend(index, start_offset, end_offset); | ||
| take_offsets[index] = end_offset; | ||
| } | ||
| } | ||
|
|
||
| if end_row_ix == indices.len() { |
There was a problem hiding this comment.
Validate indices in release builds.
Currently only the debug assert on Lines 128-134 checks index < values.len(). In release, an out-of-range index triggers take_offsets[index] and mutable.extend(index, …) panics. Please perform the bounds check just before extending and return an InvalidArgumentError instead:
Some(index) => {
+ if index >= values.len() {
+ return Err(ArrowError::InvalidArgumentError(format!(
+ "index out of bounds: {} >= {}",
+ index,
+ values.len()
+ )));
+ }
let start_offset = take_offsets[index];
let end_offset = start_offset + slice_length;This keeps malformed index arrays from crashing the process.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // This loop extends the mutable array by taking slices from the partial results. | |
| // | |
| // take_offsets keeps track of how many values have been taken from each array. | |
| let mut take_offsets = vec![0; values.len() + 1]; | |
| let mut start_row_ix = 0; | |
| loop { | |
| let array_ix = indices[start_row_ix]; | |
| // Determine the length of the slice to take. | |
| let mut end_row_ix = start_row_ix + 1; | |
| while end_row_ix < indices.len() && indices[end_row_ix] == array_ix { | |
| end_row_ix += 1; | |
| } | |
| let slice_length = end_row_ix - start_row_ix; | |
| // Extend mutable with either nulls or with values from the array. | |
| match array_ix.index() { | |
| None => mutable.extend_nulls(slice_length), | |
| Some(index) => { | |
| let start_offset = take_offsets[index]; | |
| let end_offset = start_offset + slice_length; | |
| mutable.extend(index, start_offset, end_offset); | |
| take_offsets[index] = end_offset; | |
| } | |
| } | |
| if end_row_ix == indices.len() { | |
| // This loop extends the mutable array by taking slices from the partial results. | |
| // | |
| // take_offsets keeps track of how many values have been taken from each array. | |
| let mut take_offsets = vec![0; values.len() + 1]; | |
| let mut start_row_ix = 0; | |
| loop { | |
| let array_ix = indices[start_row_ix]; | |
| // Determine the length of the slice to take. | |
| let mut end_row_ix = start_row_ix + 1; | |
| while end_row_ix < indices.len() && indices[end_row_ix] == array_ix { | |
| end_row_ix += 1; | |
| } | |
| let slice_length = end_row_ix - start_row_ix; | |
| // Extend mutable with either nulls or with values from the array. | |
| match array_ix.index() { | |
| None => mutable.extend_nulls(slice_length), | |
| Some(index) => { | |
| if index >= values.len() { | |
| return Err(ArrowError::InvalidArgumentError(format!( | |
| "index out of bounds: {} >= {}", | |
| index, | |
| values.len() | |
| ))); | |
| } | |
| let start_offset = take_offsets[index]; | |
| let end_offset = start_offset + slice_length; | |
| mutable.extend(index, start_offset, end_offset); | |
| take_offsets[index] = end_offset; | |
| } | |
| } | |
| if end_row_ix == indices.len() { |
🤖 Prompt for AI Agents
In arrow-select/src/merge.rs around lines 142 to 169, the code assumes the
extracted `index` is always in-range and uses it to index `take_offsets` and
call `mutable.extend`, which will panic in release builds; before using `index`,
perform an explicit bounds check (verify `index < values.len()` and `index <
take_offsets.len()` / equivalent) and if the check fails return an
InvalidArgumentError with a clear message about the malformed indices instead of
proceeding, otherwise continue to use `index` as now and update `take_offsets`.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct that the assertions are executed only in non-production builds. Prevents out of bounds index access issues in production builds.
There was a problem hiding this comment.
@martin-augment Thank you for confirming! I'm glad we caught this before it could cause production issues. Adding the explicit bounds check will ensure that malformed indices return a proper error instead of panicking. 🐰
/\___/\
( o o )
( =^= )
( )
( )
( )))))))
| bench_merge_input_on_all_masks( | ||
| "non_null_scalar_vs_array", | ||
| &mut group, | ||
| &masks, | ||
| &array_1_10pct_nulls, | ||
| &non_null_scalar_1, | ||
| ); | ||
|
|
There was a problem hiding this comment.
Fix the “non_null_scalar_vs_array” benchmark inputs.
Line 196 still feeds the array on the truthy side, making this case identical to “array_vs_non_null_scalar” and never exercising the scalar-then-array path. Please swap the arguments so the truthy input is the scalar and the falsy input is the array (e.g., stash let non_null_truthy = Scalar::new(&non_null_scalar_1); and pass &non_null_truthy before the array). This ensures the benchmark covers both asymmetric permutations instead of duplicating the same scenario.
🤖 Prompt for AI Agents
In arrow/benches/merge_kernels.rs around lines 194 to 201, the
"non_null_scalar_vs_array" benchmark currently passes the array as the truthy
input, duplicating the "array_vs_non_null_scalar" scenario; change it so the
truthy input is the scalar and the falsy input is the array: create a scalar
wrapper (e.g., let non_null_truthy = Scalar::new(&non_null_scalar_1);) and pass
&non_null_truthy before &array_1_10pct_nulls when calling
bench_merge_input_on_all_masks so the benchmark exercises the scalar-then-array
permutation.
There was a problem hiding this comment.
value:bug; category:documentation; feedback:The CodeRabbit AI reviewer is correct that both unit tests do the very same. The second unit test needs to swap its last two arguments. The finding prevents regressions due to untested paths.
There was a problem hiding this comment.
@martin-augment, thank you for confirming the finding! You're absolutely right that having both permutations tested is essential for preventing regressions in untested code paths. I appreciate you catching this before it made it into the codebase.
If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!
value:useful; category:bug; feedback:The Claude AI reviewer is correct that there is no check for non-empty values array before trying to use the first item from it. The finding prevents a panic caused by out of bounds access. |
value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that the capacity is set to a lower value than the number of items that are going to be inserted later. Prevents resizing of the structure during its lifetime. |
value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that there is no unit test for the case where the mask is empty. The finding increases the code coverage and would prevent regressions in the future. |
value:good-to-have; category:documentation; feedback:The Claude AI reviewer is correct that the paragraph talks about panics, not about safety, so it should be renamed. Prevents confusion of the reader. |
value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that the capacity of the vector might be one item too big. Prevents allocating more memory than needed. |
value:good-to-have; category:documentation; feedback:The Claude AI reviewer is correct that the documentation refers to the wrong function. The finding prevents confusion in the reader of the documentation. |
value:bug; category:documentation; feedback:The Claude AI reviewer is correct that both unit tests do the very same. The second unit test needs to swap its last two arguments. The finding prevents regressions due to untested paths. |
value:bug; category:documentation; feedback:The Codex AI reviewer is correct that there is no check for equal lengths of the non-scalar inputs that may lead to panic due to out of bounds index access. |
value:bug; category:documentation; feedback:The Codex AI reviewer is correct that both unit tests do the very same. The second unit test needs to swap its last two arguments. The finding prevents regressions due to untested paths. |
8753: To review by AI
Note
Introduces
mergeandmerge_nselection kernels, re-exports them incompute::kernels, and adds comprehensive benchmarks.arrow-select):merge.rswithmerge(mask-driven two-source merge with scalar support) andmerge_n(multi-array index-based merge) plus tests.src/lib.rs.arrow/src/compute/kernels.rs):mergealongside existing kernels.arrow):benches/merge_kernels.rsbenchmarking primitives, strings, and binaries across mask patterns.Cargo.tomlto include[[bench]]entrymerge_kernels.Written by Cursor Bugbot for commit 9bb40cc. This will update automatically on new commits. Configure here.